library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
##
## groups
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
library(archdata)
library(caret)
## Loading required package: lattice
library(Ckmeans.1d.dp)
library(magrittr)
Part 1
Let \(C\) represents the classification variable and \(c\) be the value of \(C\); let us consider just two classes \(\{-,+\}\). According to the bayes rule, the probability of a sample \(\mathbf{X}=(x_1,x_2,....,x_n)\) being class c is: \[p(c|\mathbf{X})=\frac{p(\mathbf{X}|c)p(c)}{p(E)}\] \(\mathbf{X}\) is classified as the class \(C=-\), if and only if \[f_b(\mathbf{X})=\frac{p(C=+|\mathbf{X})}{p(C=-|\mathbf{X})}\leq1\] where \(f_b(\mathbf{X})\) is called a Bayesian classifier. The Naive Bayesian is a particular Bayesian classifier which simplifies learning by assuming that the features are independent given the class \[P(\mathbf{x},c)=\prod_{i=1}^nP(x_i|c)\] where \(\mathbf{x}=(x_1,x_2,..,x_n)\) is a feature vector and \(c\) is the class. One surprising aspect of this classifier is that it outperforms many other sophisticated methods, despite its assumption of conditional independece, which is rarely found in real situations.
One explanation of this behaviour was given by Domingos and Pazzani (1997); they thought that although the Bayesian classifier’s probability estimates are optimal under quadratic loss if the independence assumption holds, the classifier can be optimal under zero-one loss (misclassification rate) even when this assumption is violated. The zero-one loss does not penalize the probability estimation as long as the maximum probability is assigned to the correct class, so this classifier may change the posterior probabilities of each class apart from the class with the maximum (posterior) probability. They also showed that the region of quadratic-loss optimality of the Bayesian classifier is a second-order infinitesimal fraction of the region of zero-one optimality, implying that this classifier is an optimal learner and it can be applied in different situations. For example, the true probabilities \(p(+|E)=0.9\) and \(p(-|E)=0.1\) and the probability estimates \(\hat{p}(+|E)=0.6\) and \(\hat{p}(-|E)=0.4\) are produced by Naive Bayes classifier. Obviously, the probability estimates are poor, but the classification (positive) is not affected.
Another researcher Zhang (2004) thought that the explanation given by Domingos and Pazzani was not sufficient to explain why the dependencies among attributes do not make the classification less accurate. The author provides an explanation of the behaviour of this classifier concerning the dependencies among attributes: when they work together, when they cancel each other and do not affect the classification. Zhang uses a DAG (directed acyclic graph) to explicitly represent the dependencies among attributes, where the class node is directly linked with the features nodes and there are also links between features nodes (ANB, Augmented Naive Bayes) as follows:
edges <- c(1,2,1,3,1,4,1,5,4,5,2,3,4,3)
g<-graph(edges, n=max(edges), directed=TRUE)
plot(g)
This graph represents the following joint probability distribution \[p_G(x_1,...,x_n,c) = P(c)\prod_{i=1}^{n} P(x_i\mid pa(x_i),c)\]
where \(pa(x_i)\) denotes the parents of the node \(x_i\) in the graph, and \(c\) is the root (class) node. Given just two classes \(\{-,+\}\), for a node \(x\) its local dependence can be measured as the ratio between the conditional probability of the node given its parents and the root, over the conditional probability of the node conditioned on the class node:
\[dd_G^+(x\mid pa(x))=\frac{p(x\mid pa(x), +)}{p(x\mid +)}\] \[dd_G^-(x\mid pa(x))=\frac{p(x\mid pa(x), -)}{p(x\mid -)}\]
Taking the ratio one can quantify the influence of \(X's\) local dependence on the classification: \[ddr_G(x)= \frac{dd_{G}^+(x \mid pa(x))}{dd_{G}^-(x \mid pa(x))}\]
there can be the following results:
When \(X\) has no parent \(ddr_G(x)=1\), because \(dd_G^+=dd_G^-=1\) .
When the local dependencies in both classes support different classifications, they partially cancel each other out, and the final classification will be the class with the greatest local dependence. This shows that the ratio of the local dependencies ultimately determines which classification the local dependence of a node supports.
Given an ANB graph and its correspondent naive Bayes \(G_{nb}\) (removing all arcs between features in the ANB graph), it is true that \[f_b(x_1, x_2, ...,x_n)=f_{nb}(x_1, x_2, ..., x_n) \cdot \prod_{i=1}^n ddr_G(x_i)\]
where \(f_b\) and \(f_{nb}\) are the Bayesian and Naive Bayesian classifiers respectively, and \(\displaystyle \prod_{i=1}^n ddr_G(x_i)\) is the dependence distribution factor. Under Zero-One Loss, \(f_b(x_1, x_2, ..., x_n)=f_{nb}(x_1, x_2, ..., x_n)\) if and only if \(f_b(x_1, x_2, ...,x_n) \ge 1\) and \(\displaystyle \prod_{i=1}^n ddr_G(x_i) \le f_b(x_1, x_2, ...,x_n)\), or when \(f_b(x_1, x_2, ..., x_n) < 1\), \(\displaystyle\prod_{i=1}^nddr_G(x_i) > f_b(x_1, x_2, ..., x_n)\).
References:
1)Domingos, P., and Pazzani, M. 1997. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. [online] Available at: https://www.ics.uci.edu/~pazzani/Publications/MLJ97.pdf
2)Zhang, H. 2004. The Optimality of Naive Bayes. American Association for Artificial Intelligence. [online] Available at: http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf
Part 2
The main goal of this part is to analyze a dataset with only two outcomes and try to implement two different binary classification models, the LDA classifier and the Naive Bayes classifier.
load('daily-sport.RData')
Take a look at data
# Information about dataset
str(dailysport)
## 'data.frame': 30000 obs. of 46 variables:
## $ id : Factor w/ 4 levels "crosstr","jumping",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ T-xAcc : num 8.72 8.62 8.93 9.07 9.05 ...
## $ T-yAcc : num 2.14 2.05 2.05 2.08 1.98 ...
## $ T-zAcc : num 3.83 3.45 3.28 3.23 3.41 ...
## $ T-xGyro : num 0.294 0.23 0.209 0.197 0.238 ...
## $ T-yGyro : num -0.252 -0.26 -0.344 -0.241 -0.098 ...
## $ T-zGyro : num -0.1506 -0.13245 -0.10109 -0.06141 -0.00907 ...
## $ T-xMag : num -0.577 -0.586 -0.596 -0.605 -0.61 ...
## $ T-yMag : num 0.1086 0.0974 0.0896 0.0825 0.075 ...
## $ T-zMag : num -0.697 -0.69 -0.684 -0.675 -0.672 ...
## $ RA-xAcc : num 8.59 8.5 9.09 9.36 9.34 ...
## $ RA-yAcc : num 3.51 3.77 3.8 3.77 3.94 ...
## $ RA-zAcc : num 1.58 1.64 1.62 1.68 1.83 ...
## $ RA-xGyro: num 0.492 0.252 0.168 0.25 0.269 ...
## $ RA-yGyro: num -0.224 -0.381 -0.421 -0.352 -0.276 ...
## $ RA-zGyro: num 0.532 0.65 0.706 0.597 0.373 ...
## $ RA-xMag : num -0.532 -0.55 -0.574 -0.594 -0.614 ...
## $ RA-yMag : num -0.476 -0.472 -0.461 -0.452 -0.443 ...
## $ RA-zMag : num -0.594 -0.582 -0.571 -0.559 -0.545 ...
## $ LA-xAcc : num 9.22 8.47 8.49 8.43 8.08 ...
## $ LA-yAcc : num -2.77 -2.76 -3.02 -3.07 -3.59 ...
## $ LA-zAcc : num 3.06 2.72 2.51 2.54 3.06 ...
## $ LA-xGyro: num 0.301 0.0294 -0.0944 -0.6035 -1.0688 ...
## $ LA-yGyro: num -0.0141 -0.0514 -0.0233 0.1012 0.1644 ...
## $ LA-zGyro: num -0.488 -0.379 -0.342 -0.325 -0.265 ...
## $ LA-xMag : num -0.484 -0.492 -0.503 -0.511 -0.52 ...
## $ LA-yMag : num 0.727 0.717 0.712 0.709 0.711 ...
## $ LA-zMag : num -0.256 -0.262 -0.261 -0.251 -0.229 ...
## $ RL-xAcc : num -11.65 -10.9 -10.09 -8.85 -8.46 ...
## $ RL-yAcc : num 0.112 -1.163 -1.985 -0.721 0.802 ...
## $ RL-zAcc : num 0.464 1.118 1.285 1.398 1.577 ...
## $ RL-xGyro: num -1.69 -1.268 -0.476 -0.305 -0.278 ...
## $ RL-yGyro: num 0.411 0.437 0.148 0.195 0.235 ...
## $ RL-zGyro: num 1.55 1.68 1.57 1.55 1.36 ...
## $ RL-xMag : num 0.794 0.763 0.727 0.689 0.646 ...
## $ RL-yMag : num -0.427 -0.485 -0.54 -0.587 -0.636 ...
## $ RL-zMag : num 0.192 0.177 0.171 0.166 0.164 ...
## $ LL-xAcc : num -9.74 -9.53 -9.97 -9.66 -9.77 ...
## $ LL-yAcc : num -0.434 -0.495 1.045 -0.436 -0.523 ...
## $ LL-zAcc : num -1.46 -1.55 -1.27 -1.04 -1.66 ...
## $ LL-xGyro: num -0.291 -0.348 -0.417 -0.234 -0.344 ...
## $ LL-yGyro: num 0.0618 0.0619 0.0789 -0.0764 0.0258 ...
## $ LL-zGyro: num 0.32 0.291 0.297 0.478 0.407 ...
## $ LL-xMag : num 0.8 0.805 0.811 0.817 0.823 ...
## $ LL-yMag : num 0.437 0.43 0.423 0.412 0.397 ...
## $ LL-zMag : num -0.166 -0.159 -0.149 -0.144 -0.143 ...
summary(dailysport)
## id T-xAcc T-yAcc T-zAcc
## crosstr:7500 Min. :-11.526 Min. :-14.0190 Min. :-8.363
## jumping:7500 1st Qu.: 6.584 1st Qu.: -0.3961 1st Qu.: 1.420
## stepper:7500 Median : 8.751 Median : 0.3336 Median : 2.848
## walking:7500 Mean : 9.227 Mean : 0.3792 Mean : 3.029
## 3rd Qu.: 10.936 3rd Qu.: 1.3094 3rd Qu.: 3.972
## Max. : 70.835 Max. : 14.0120 Max. :33.093
## T-xGyro T-yGyro T-zGyro
## Min. :-4.696900 Min. :-9.85770 Min. :-2.1615000
## 1st Qu.:-0.330675 1st Qu.:-0.26165 1st Qu.:-0.1770200
## Median : 0.016853 Median : 0.02464 Median :-0.0009475
## Mean : 0.001798 Mean : 0.01580 Mean : 0.0014212
## 3rd Qu.: 0.359052 3rd Qu.: 0.33622 3rd Qu.: 0.1753525
## Max. : 4.530300 Max. : 5.59380 Max. : 1.9199000
## T-xMag T-yMag T-zMag RA-xAcc
## Min. :-0.9246 Min. :-0.58331 Min. :-0.8606 Min. :-25.2760
## 1st Qu.:-0.7513 1st Qu.:-0.15989 1st Qu.:-0.6281 1st Qu.: -0.1473
## Median :-0.6952 Median :-0.06689 Median :-0.4968 Median : 2.7679
## Mean :-0.7000 Mean : 0.02185 Mean :-0.4154 Mean : 3.1426
## 3rd Qu.:-0.6171 3rd Qu.: 0.25817 3rd Qu.:-0.2799 3rd Qu.: 7.7249
## Max. :-0.4142 Max. : 0.56106 Max. : 0.2717 Max. : 21.5890
## RA-yAcc RA-zAcc RA-xGyro
## Min. :-10.587 Min. :-16.557 Min. :-7.75480
## 1st Qu.: 2.839 1st Qu.: 1.847 1st Qu.:-0.43318
## Median : 4.864 Median : 3.967 Median : 0.01838
## Mean : 6.086 Mean : 4.166 Mean : 0.01600
## 3rd Qu.: 7.529 3rd Qu.: 6.468 3rd Qu.: 0.47524
## Max. : 51.797 Max. : 39.010 Max. : 9.55990
## RA-yGyro RA-zGyro RA-xMag
## Min. :-6.432800 Min. :-4.343000 Min. :-0.94995
## 1st Qu.:-0.365402 1st Qu.:-0.654562 1st Qu.:-0.46215
## Median :-0.014138 Median :-0.017570 Median :-0.22478
## Mean :-0.009334 Mean : 0.002844 Mean :-0.25362
## 3rd Qu.: 0.362628 3rd Qu.: 0.635118 3rd Qu.:-0.01952
## Max. : 4.853700 Max. : 5.177800 Max. : 0.93707
## RA-yMag RA-zMag LA-xAcc LA-yAcc
## Min. :-1.1059 Min. :-1.1374 Min. :-31.6400 Min. :-57.939
## 1st Qu.:-0.7037 1st Qu.:-0.6716 1st Qu.: 0.6757 1st Qu.: -6.833
## Median :-0.5470 Median :-0.5559 Median : 4.3583 Median : -4.362
## Mean :-0.4846 Mean :-0.5029 Mean : 4.0236 Mean : -5.387
## 3rd Qu.:-0.2958 3rd Qu.:-0.3964 3rd Qu.: 8.1350 3rd Qu.: -2.387
## Max. : 0.3295 Max. : 0.3614 Max. : 50.9240 Max. : 30.246
## LA-zAcc LA-xGyro LA-yGyro
## Min. :-23.7440 Min. :-8.833200 Min. :-8.381800
## 1st Qu.: 0.7575 1st Qu.:-0.469542 1st Qu.:-0.382228
## Median : 3.2938 Median :-0.004280 Median :-0.024567
## Mean : 2.9418 Mean :-0.007128 Mean :-0.004787
## 3rd Qu.: 6.2104 3rd Qu.: 0.426720 3rd Qu.: 0.368470
## Max. : 24.6540 Max. :18.809000 Max. : 4.358900
## LA-zGyro LA-xMag LA-yMag
## Min. :-12.935000 Min. :-0.93608 Min. :-0.9267
## 1st Qu.: -0.624395 1st Qu.:-0.49630 1st Qu.: 0.3048
## Median : 0.027278 Median :-0.03208 Median : 0.4356
## Mean : -0.008746 Mean :-0.14015 Mean : 0.3854
## 3rd Qu.: 0.622652 3rd Qu.: 0.21392 3rd Qu.: 0.5705
## Max. : 5.679100 Max. : 0.89720 Max. : 0.8499
## LA-zMag RL-xAcc RL-yAcc RL-zAcc
## Min. :-1.1028 Min. :-89.704 Min. :-52.727 Min. :-47.9250
## 1st Qu.:-0.6849 1st Qu.:-11.917 1st Qu.: -1.034 1st Qu.: -1.1442
## Median :-0.4257 Median : -9.704 Median : 1.115 Median : -0.1701
## Mean :-0.2640 Mean :-10.051 Mean : 1.084 Mean : -0.2483
## 3rd Qu.: 0.1127 3rd Qu.: -7.624 3rd Qu.: 3.362 3rd Qu.: 0.8341
## Max. : 0.7589 Max. : 3.008 Max. : 80.427 Max. : 17.0360
## RL-xGyro RL-yGyro RL-zGyro
## Min. :-5.14820 Min. :-2.081100 Min. :-3.185800
## 1st Qu.:-0.52341 1st Qu.:-0.198860 1st Qu.:-1.127500
## Median : 0.04027 Median : 0.008418 Median :-0.064018
## Mean : 0.01691 Mean : 0.012355 Mean :-0.003762
## 3rd Qu.: 0.54979 3rd Qu.: 0.219955 3rd Qu.: 1.074200
## Max. : 6.01390 Max. : 3.311100 Max. : 3.665900
## RL-xMag RL-yMag RL-zMag LL-xAcc
## Min. :0.3556 Min. :-0.7856 Min. :-0.62804 Min. :-93.940
## 1st Qu.:0.4882 1st Qu.:-0.2595 1st Qu.:-0.32134 1st Qu.:-11.757
## Median :0.6394 Median :-0.1207 Median :-0.12695 Median : -9.556
## Mean :0.6510 Mean :-0.1197 Mean :-0.01181 Mean : -9.989
## 3rd Qu.:0.8168 3rd Qu.:-0.0391 3rd Qu.: 0.31716 3rd Qu.: -7.699
## Max. :1.0715 Max. : 0.7485 Max. : 0.57962 Max. : 3.592
## LL-yAcc LL-zAcc LL-xGyro
## Min. :-95.8980 Min. :-27.1510 Min. :-7.15920
## 1st Qu.: -4.1598 1st Qu.: -1.4636 1st Qu.:-0.56632
## Median : -1.7903 Median : -0.3309 Median :-0.06929
## Mean : -1.7952 Mean : -0.4361 Mean :-0.03791
## 3rd Qu.: 0.4201 3rd Qu.: 0.7115 3rd Qu.: 0.48713
## Max. : 49.4040 Max. : 33.0620 Max. :10.38100
## LL-yGyro LL-zGyro LL-xMag
## Min. :-3.605000 Min. :-4.115000 Min. :0.2951
## 1st Qu.:-0.227410 1st Qu.:-1.116200 1st Qu.:0.4771
## Median : 0.004304 Median : 0.018875 Median :0.6287
## Mean : 0.004683 Mean :-0.006432 Mean :0.6244
## 3rd Qu.: 0.231900 3rd Qu.: 1.148025 3rd Qu.:0.7691
## Max. : 4.207600 Max. : 3.526500 Max. :1.0377
## LL-yMag LL-zMag
## Min. :-0.7177 Min. :-0.52303
## 1st Qu.: 0.1047 1st Qu.:-0.27879
## Median : 0.2789 Median : 0.01363
## Mean : 0.2626 Mean :-0.02575
## 3rd Qu.: 0.4749 3rd Qu.: 0.14591
## Max. : 1.0249 Max. : 0.62156
In this dataset there are 30000 observations of 46 variables, where 45 over to 46 of them are data taken by 9 different sensors (x,y,z accelerometers, x,y,z gyroscopes, x,y,z magnetometers) arranged on torso (T), right arm (RA), left arm (LA), right leg (RL), left leg (LL) of one person who did 4 different activities (walking, stepper, cross trainer, jumping), this is the 46th variable, which is a factor variable and it allows us to identify these 4 activities (7500 variables per activity).
Check if there are some NA values in our dataset
sum(is.na(dailysport))
## [1] 0
There are not NA values so it is not necessary to deal with them.
The main goal is to distinguish these human activities using the data above.
Firstly we are going to reduce the dimension of our dataset, taking into account just 2 activities (jumping and walking) and only 3 sensors on just one location (x,y,z accelerometers on Torso (T)).
#small dataset
ds.small<-dailysport %>% subset(id=='walking' | id=='jumping',select=c('id','T-xAcc','T-yAcc','T-zAcc'))%>%droplevels()
colnames(ds.small)<-c('id','T_xAcc','T_yAcc','T_zAcc')
Take a look at the main properties of these three components for the two different classes
summary(ds.small %>% subset(id=='walking')%>%droplevels())
## id T_xAcc T_yAcc T_zAcc
## walking:7500 Min. : 5.252 Min. :-1.583 Min. :0.6822
## 1st Qu.: 8.062 1st Qu.: 0.299 1st Qu.:3.0790
## Median : 8.785 Median : 1.147 Median :3.5861
## Mean : 9.068 Mean : 1.144 Mean :3.4957
## 3rd Qu.:10.079 3rd Qu.: 1.939 3rd Qu.:4.0489
## Max. :15.455 Max. : 4.448 Max. :6.4017
summary(ds.small %>% subset(id=='jumping')%>%droplevels())
## id T_xAcc T_yAcc T_zAcc
## jumping:7500 Min. :-11.5260 Min. :-14.0190 Min. :-8.363
## 1st Qu.: -0.8114 1st Qu.: -0.3693 1st Qu.:-0.150
## Median : 3.4878 Median : 0.1510 Median : 1.659
## Mean : 9.0261 Mean : 0.2096 Mean : 3.638
## 3rd Qu.: 18.8015 3rd Qu.: 0.7745 3rd Qu.: 6.367
## Max. : 70.8350 Max. : 14.0120 Max. :33.093
One can see in the considered dataset:
For the walking class the ‘T_yAcc’ and the ‘T_zAcc’ are features which have the same median and mean, whereas the ‘T_xAcc’ has different values for them.
For the jumping class the ‘T_xAcc’ and the ‘T_zAcc’ are features which have different values for the median and mean, whereas the ‘T_yAcc’ has the same values for them.
Take a look at this point, treating them as a 3D point cloud in the Euclidean space, to plot them we are going to consider just 1000 random points.
set.seed(123)
# sample 1000 points from our dataset
points<-ds.small[sample(nrow(ds.small),1000),]
# 3D scatterplot
plot_ly(points, x = ~T_xAcc, y = ~T_yAcc, z = ~T_zAcc,color=~id,colors = c('green', 'purple'),marker=list(size=4)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'x'),
yaxis = list(title = 'y'),
zaxis = list(title = 'z')))
The plot shows that there is a big concentration of points for both classes around the origin, whereas looking at the graph around an \(x=50\) the points of jumping class are more scattered, so the points of walking class are concentrated around the origin forming one cluster which is inside the cluster formed by the points of the jumping class; one can guess that there is no linear classifier that builds a plane in the euclidean space which divides the two clusters perfectly: it means that whatever linear classifier will be used, it will not work well, how it is shown from the following analysis. Firstly data are splitted in train-set and test-set taking 70% and 30% of them respectively.
## 70% of the sample size
samp_size <- floor(0.70 * nrow(ds.small))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ds.small)), size = samp_size)
ds.train <- ds.small[train_ind, ]
ds.test <- ds.small[-train_ind, ]
Given them we may proceed to estimate:
\(f_1(.)\) from the \(\mathbf{X_i}\) for which \(Y_i=1\), obtaining \(\hat{f_1}(.)\)
\(f_0(.)\) from the \(\mathbf{X_i}\) for which \(Y_0=0\), obtaining \(\hat{f_0}(.)\)
the subpopulation proportion \(\pi_1\) whith \(\hat{\pi_1}\) with \(\hat{\pi_1}=\sum_{i=1}^nY_i\)
the subpopulation proportion \(\pi_0\) whith \(\hat{\pi_0}\) with \(\hat{\pi_0}=\sum_{i=1}^nY_i\)
and define
\[\hat{r}(\mathbf{x})=\frac{\hat{\pi_1}\hat{f_1}(x)}{\hat{\pi_1}\hat{f_1}(x)+\hat{\pi_0}\hat{f_0}(x)}\to\hat{\eta}(\mathbf{x})=\begin{cases} 1 & \quad \text{if } \hat{r}(\mathbf{x})>\frac{1}{2}\\ 0 & \quad \text{ otherwise} \end{cases}\]
The simplest approach consists in assuming a parametric model for the class conditional densities \(f_1(.)\) and \(f_0(.)\), so we are going to use the LDA model.
Linear discriminant analysis
The main assumptions are:
The conditional probability density functions \(p(\mathbf{x}|y=0)\) and \(p(\mathbf{x}|y=1)\) are both normally distributed
The class covariances are identical (homoscedasticity assumption)
# LDA
lda.out<-lda(id ~ .,data=ds.train)
After using LDA model, we are able to predict the class-labels on the training set and on the test set
# Prediction on the train set
pred.tr = predict(lda.out, ds.train[,-1])$class
# Prediction on the test set
pred.te = predict(lda.out, ds.test[,-1])$class
After creating the confusion matrix we can evaluate the performance of the model on the train set and on the test set, calculating the empirical error rate, which is defined as \[\hat{L_n}(\eta)=\frac{1}{n}\sum_{i=1}^nI(\eta(\mathbf{X_i})\neq Y_i)\]
# Empirical error rate train set
table(pred.tr , ds.train$id)
##
## pred.tr jumping walking
## jumping 3647 1952
## walking 1584 3317
mean(pred.tr != ds.train$id)
## [1] 0.3367619
# Empirical error rate test set
table(pred.te , ds.test$id)
##
## pred.te jumping walking
## jumping 1608 834
## walking 661 1397
mean(pred.te != ds.test$id)
## [1] 0.3322222
As we can see the results are very similiar. One can assume that either the model is underfitted, because we have high values for the train and for the test, or the train set and the test set are based on similar points, how it is possible to see on the plot where points are very close to each other, in particular around the origin.
Naive Bayes Classifier
Let us try to use a Naive Bayes Classifier, assuming that, conditionally on the class label, the feature vector \(\mathbf{X}\) has independent components, so \[f_1(x)=\prod_{j=1}^{d}f_{1,j}(x_j) \quad\ and \quad\ f_0(x)=\prod_{j=1}^{d}f_{0,j}(x_j)\] Now we are going to apply the naive Bayes classifier to implement a classification model using the naiveBayes() function implemented in R, where each feature is assumed to follow a Gaussian distribution. The same train set and test set are used
naive.Bayes.out<-naiveBayes(id ~ .,data=ds.train, type = "raw")
After using Naive Bayes model, we are able to predict the class-labels on the training set and on the test set
# Prediction on the train set
pred.tr = predict(naive.Bayes.out, ds.train[,-1])
# Prediction on the test set
pred.te = predict(naive.Bayes.out, ds.test[,-1])
and evaluate the model using the empirical error rate
# Empirical error rate train set
table(pred.tr , ds.train$id)
##
## pred.tr jumping walking
## jumping 4784 42
## walking 447 5227
mean(pred.tr != ds.train$id)
## [1] 0.04657143
# Empirical error rate test set
table(pred.te , ds.test$id)
##
## pred.te jumping walking
## jumping 2087 19
## walking 182 2212
mean(pred.te != ds.test$id)
## [1] 0.04466667
The calculated error rate in the train set and test set are almost the same but the predictions are more accurate than the LDA’s. Indeed 33% of the predictions the LDA classifier makes are wrong whereas the Naive Bayes classifier is just in 4% of the predictions wrong.
We are going to plot the estimated distributions with the gaussian kernel to see if there are significant differences in the two activities for each variable. For each distribution mentioned above we used the bootstrap method to estimate the 95% confidence interval, implemented as follows:
we sampled from our features with replacements
we calculated the density using a gaussian kernel per each iteration
we calculated the quantile function on the bootstrap vector to have the CI
boot.funct<-function(df,class,param,dens){
subdf<-subset(df,id==class)[,param]
fit2 <- replicate(1000,{
x <- sample(subdf, replace=TRUE)
density(x,kernel='gaussian',from=min(dens$x),to=max(dens$x))$y})
fit3 <- apply(fit2, 1, quantile, c(0.025,0.975) )
return(fit3)
}
# Density
wx<-density(subset(ds.small,id=='walking')$T_xAcc,kernel = 'gaussian')
jx<-density(subset(ds.small,id=='jumping')$T_xAcc,kernel = 'gaussian')
final_x=as.data.frame(cbind(wx$x,wx$y,jx$x,jx$y))
colnames(final_x)<-c('walk_x','walk_densx','jump_x','jump_densx')
#Bootstrap starts T_xAcc walking
fitwalk_xCI<-boot.funct(ds.small,'walking','T_xAcc',wx)
#Bootstrap starts T_xAcc jumping
fitjump_xCI<-boot.funct(ds.small,'jumping','T_xAcc',jx)
layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))
plot(jx,ylim=c(0,max(max(wx$y),max(jx$y))),main = 'Kernel Density T_xAcc',xlab='x',col='red')
lines(wx,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)
#Plot CI Walking x
plot(wx, ylim = range(fitwalk_xCI),main = 'CI density T_xAcc walking',xlab='x')
polygon( c(wx$x, rev(wx$x)), c(fitwalk_xCI[1,], rev(fitwalk_xCI[2,])),col='black', border=F)
lines(wx, col = "red", lwd = 2)
#Plot CI jumping x
plot(jx, ylim = range(fitjump_xCI),main = 'CI density T_xAcc walking',xlab='x')
polygon( c(jx$x, rev(jx$x)), c(fitjump_xCI[1,], rev(fitjump_xCI[2,])),col='black', border=F)
lines(jx, col = "red", lwd = 2)
# Density
wy<-density(subset(ds.small,id=='walking')$T_yAcc,kernel = 'gaussian')
jy<-density(subset(ds.small,id=='jumping')$T_yAcc,kernel = 'gaussian')
final_y=as.data.frame(cbind(wy$x,wy$y,jy$x,jy$y))
colnames(final_y)<-c('walk_y','walk_densy','jump_y','jump_densy')
#Bootstrap starts T_yAcc walking
fitwalk_yCI<-boot.funct(ds.small,'walking','T_yAcc',wy)
#Bootstrap starts T_yAcc jumping
fitjump_yCI<-boot.funct(ds.small,'jumping','T_yAcc',jy)
layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))
plot(jy,ylim=c(0,max(max(wy$y),max(jy$y))),main = 'Kernel Density T_yAcc',xlab='y',col='red')
lines(wy,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)
#Plot CI Walking y
plot(wy, ylim = range(fitwalk_yCI),main = 'CI density T_yAcc walking',xlab='y')
polygon( c(wy$x, rev(wy$x)), c(fitwalk_yCI[1,], rev(fitwalk_yCI[2,])),col='black', border=F)
lines(wy, col = "red", lwd = 2)
#Plot CI jumping y
plot(jy, ylim = range(fitjump_yCI),main = 'CI density T_yAcc walking',xlab='y')
polygon( c(jy$x, rev(jy$x)), c(fitjump_yCI[1,], rev(fitjump_yCI[2,])),col='black', border=F)
lines(jy, col = "red", lwd = 2)
wz<-density(subset(ds.small,id=='walking')$T_zAcc,kernel = 'gaussian')
jz<-density(subset(ds.small,id=='jumping')$T_zAcc,kernel = 'gaussian')
final_z=as.data.frame(cbind(wz$x,wz$y,jz$x,jz$y))
colnames(final_z)<-c('walk_z','walk_densz','jump_z','jump_densz')
#Bootstrap starts T_yAcc walking
fitwalk_zCI<-boot.funct(ds.small,'walking','T_zAcc',wz)
#Bootstrap starts T_yAcc jumping
fitjump_zCI<-boot.funct(ds.small,'jumping','T_zAcc',jz)
layout(matrix(c(1,1,1,1,2,3,2,3), nrow = 4, ncol = 2, byrow = TRUE))
plot(jz,ylim=c(0,max(max(wz$y),max(jz$y))),main = 'Kernel Density T_zAcc',xlab='z',col='red')
lines(wz,col='blue')
legend('topright', legend=c("Walking", "Jumping"),col=c("blue", "red"), lty=1, cex=0.8)
#Plot CI Walking z
plot(wz, ylim = range(fitwalk_zCI),main = 'CI density T_zAcc walking',xlab='z')
polygon( c(wz$x, rev(wz$x)), c(fitwalk_zCI[1,], rev(fitwalk_zCI[2,])),col='black', border=F)
lines(wz, col = "red", lwd = 2)
#Plot CI jumping z
plot(jz, ylim = range(fitjump_zCI),main = 'CI density T_yAcc walking',xlab='z')
polygon( c(jz$x, rev(jz$x)), c(fitjump_zCI[1,], rev(fitjump_zCI[2,])),col='black', border=F)
lines(jz, col = "red", lwd = 2)
For the variables T_xAcc and T_zAcc the distributions of “Walking” and “Jumping” are different. In both cases the values of “Walking” have low varince whereas the values of “Jumping” have a larger variance, so it means that using individually these 2 variables we are able to distinguish the 2 classes. On the other hand, it is not the same for the variable T_yAcc, because the two curves are overlapping.
Implementation Naive Bayes
The idea is to use a kernel density instead of a multinomial distribution. The individual class-conditional densities \(\{f_{1,j}(·)\}_j\) and \(\{f_{0,j}(·)\}_j\) can each be estimated separately using 1-dimensional kernel density estimates and, since we assume that the variables are independent, the joint class-conditional can be calculated as follows:
\[f_1(x)=\prod_{j=1}^{d}f_{1,j}(x_j) \quad\ and \quad\ f_0(x)=\prod_{j=1}^{d}f_{0,j}(x_j)\]
#density
density_covariate <- function (column) {
den <- density(column,kernel = 'gaussian')
funct <- approxfun(den$x, den$y)
return (funct)
}
#Naive Bayes
Naive_Bayes <- function (newdataset, prior_one_class, prior_second_class,densw,densj) {
n_hat = rep(NA, nrow(newdataset))
for (i in 1:nrow(newdataset)){
row = newdataset[i,]
post_1 = prior_one_class * densw[[1]](row[2]) * densw[[2]](row[3]) * densw[[3]](row[4])
post_2 = prior_second_class * densj[[1]](row[2]) * densj[[2]](row[3]) * densj[[3]](row[4])
r_hat = post_1/(post_1+post_2)
n_hat[i]=ifelse(is.na(post_1) || r_hat < 0.5,"jumping","walking")
}
return (n_hat)
}
ds.w <- subset(ds.train, id=='walking',select = c('T_xAcc','T_yAcc','T_zAcc'))
ds.j <- subset(ds.train, id=='jumping',select = c('T_xAcc','T_yAcc','T_zAcc'))
dens.w<-sapply(ds.w,density_covariate)
dens.j<-sapply(ds.j,density_covariate)
prior_one_class <- nrow(ds.w)/length(ds.train$id)
prior_second_class <- nrow(ds.j)/length(ds.train$id)
result <- Naive_Bayes (ds.test, prior_one_class, prior_second_class,dens.w,dens.j)
result2 <- Naive_Bayes (ds.train, prior_one_class, prior_second_class,dens.w,dens.j)
Let us evaluate our implementation of Naive Bayes classifier and calculate the error rate:
# Empirical error rate train set
table(result2 , ds.train$id)
##
## result2 jumping walking
## jumping 4800 16
## walking 431 5253
mean(result2 != ds.train$id)
## [1] 0.04257143
# Empirical error rate test set
table(result , ds.test$id)
##
## result jumping walking
## jumping 2093 6
## walking 176 2225
mean(result != ds.test$id)
## [1] 0.04044444
The results of our implementation are better than the results of the parametric Naive Bayes method in both train and test set. Comparing the results of the personal classifier, one can see that with the method used before the number of True/Negative (jumping/walking) in this case is 16 and 6 respectivily for the train set and the test set, whereas in the model before we obtained 42 and 19. The precision of the classifier is higher because the approximantion of the distributions are less biased, so the error will be less than the previous case.
Part 3
Now we are going to analyze the whole dataset and try to predict all activities instead of two like above. We will use the LDA method to see if we can identify all activities with a linear classifier.
## 70% of the sample size
samp_size <- floor(0.70 * nrow(dailysport))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(dailysport)), size = samp_size)
train <- dailysport[train_ind, ]
test <- dailysport[-train_ind, ]
# LDA
lda.out<-lda(id ~ .,data=train)
# Prediction on the test set
pred.te = predict(lda.out, test[,-1])$class
# Empirical error rate test set
table(pred.te , test$id)
##
## pred.te crosstr jumping stepper walking
## crosstr 2261 0 0 0
## jumping 0 2225 0 0
## stepper 0 0 2264 0
## walking 0 0 0 2250
mean(pred.te != test$id)
## [1] 0
From the result we can say that the 4 groups are well seperated and so a linear classifier is enough to distinguish the 4 classes. Just to support our idea, we also try to use random forest with boosting with the package xgboost.
ds.origin = dailysport
dailysport$id <- as.numeric(dailysport$id)
dailysport$id <- dailysport$id - 1
dailysportM <- as.matrix((dailysport[,-1]))
data_label <- dailysport[,"id"]
dailysportXGB <- xgb.DMatrix(data = as.matrix(dailysport), label = data_label)
## 70% of the sample size
samp_size <- floor(0.70 * nrow(dailysport))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(dailysport)), size = samp_size)
train <- dailysportM[train_ind, ]
train_label <- data_label[train_ind]
train_matrix <- xgb.DMatrix(data = train, label = train_label)
test <- dailysportM[-train_ind, ]
test_label <- data_label[-train_ind]
test_matrix <- xgb.DMatrix(data = test, label = test_label)
n_class = length(unique(dailysport$id))
params = list('objective' = 'multi:softprob','eval_metric' = 'merror','num_class' = n_class)
nround <- 50
cv.nfold = 5
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = nround,
nfold = cv.nfold,
verbose = FALSE,
prediction = TRUE)
OOF_prediction <- data.frame(cv_model$pred) %>%
mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(OOF_prediction)
## X1 X2 X3 X4 max_prob label
## 1 2.565523e-05 2.206023e-05 9.999272e-01 2.508805e-05 3 3
## 2 2.225092e-05 9.999025e-01 2.123043e-05 5.406532e-05 2 2
## 3 2.463430e-04 2.193917e-05 9.997148e-01 1.690017e-05 3 3
## 4 3.986130e-05 9.999193e-01 2.169082e-05 1.915286e-05 2 2
## 5 2.105451e-05 9.999385e-01 2.151386e-05 1.891940e-05 2 2
## 6 2.041578e-05 2.406958e-05 2.086117e-05 9.999347e-01 4 4
confusionMatrix(factor(OOF_prediction$max_prob),
factor(OOF_prediction$label),
mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 5239 0 0 0
## 2 0 5275 0 0
## 3 0 0 5236 0
## 4 0 0 0 5250
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2512
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 1.0000 1.0000 1.0000 1.00
## Specificity 1.0000 1.0000 1.0000 1.00
## Pos Pred Value 1.0000 1.0000 1.0000 1.00
## Neg Pred Value 1.0000 1.0000 1.0000 1.00
## Precision 1.0000 1.0000 1.0000 1.00
## Recall 1.0000 1.0000 1.0000 1.00
## F1 1.0000 1.0000 1.0000 1.00
## Prevalence 0.2495 0.2512 0.2493 0.25
## Detection Rate 0.2495 0.2512 0.2493 0.25
## Detection Prevalence 0.2495 0.2512 0.2493 0.25
## Balanced Accuracy 1.0000 1.0000 1.0000 1.00
### test
bst_model <- xgb.train(params = params,
data = train_matrix,
nrounds = nround)
# Predict hold-out test set
test_pred <- predict(bst_model, newdata = test_matrix)
test_prediction <- matrix(test_pred, nrow = n_class,
ncol=length(test_pred)/n_class) %>%
t() %>%
data.frame() %>%
mutate(label = test_label + 1,
max_prob = max.col(., "last"))
# confusion matrix of test set
confusionMatrix(factor(test_prediction$max_prob),
factor(test_prediction$label),
mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2261 0 0 0
## 2 0 2225 0 0
## 3 0 0 2264 0
## 4 0 0 0 2250
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9996, 1)
## No Information Rate : 0.2516
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 1.0000 1.0000 1.0000 1.00
## Specificity 1.0000 1.0000 1.0000 1.00
## Pos Pred Value 1.0000 1.0000 1.0000 1.00
## Neg Pred Value 1.0000 1.0000 1.0000 1.00
## Precision 1.0000 1.0000 1.0000 1.00
## Recall 1.0000 1.0000 1.0000 1.00
## F1 1.0000 1.0000 1.0000 1.00
## Prevalence 0.2512 0.2472 0.2516 0.25
## Detection Rate 0.2512 0.2472 0.2516 0.25
## Detection Prevalence 0.2512 0.2472 0.2516 0.25
## Balanced Accuracy 1.0000 1.0000 1.0000 1.00
# get the feature real names
names <- colnames(dailysport[,-1])
# compute feature importance matrix
importance_matrix = xgb.importance(feature_names = names, model = bst_model)
gp = xgb.ggplot.importance(importance_matrix)
print(gp)
As we can see the predictions are the same. We assumed above that the points are well separeted. The last plot shows us the most used variables to split the data. To check our assumption above we are going to plot the first 3 variables.
set.seed(123)
#small dataset
ds<-ds.origin %>% subset(select=c('id','LA-xMag','T-yMag','RL-zMag'))%>%droplevels()
colnames(ds)<-c('id','LA_xMag','T_yMag','RL_zMag')
# sample 1000 points from our dataset
points<-ds[sample(nrow(ds),30000),]
# 3D scatterplot
plot_ly(points, x = ~LA_xMag, y = ~T_yMag, z = ~RL_zMag,color=~id,colors = c('green', 'purple','red','blue'),marker=list(size=4)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'x'),
yaxis = list(title = 'y'),
zaxis = list(title = 'z')))
As we expected, the classes are well separated.